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Abstract 

Background: DNA methylation occurs in the context of a CpG dinucleotide. It is an important epigenetic 
modification, which can be inherited through cell division. The two major types of methylation include 
hypomethylation and hypermethylation. Unique methylation patterns have been shown to exist in diseases 
including various types of cancer. DNA methylation analysis promises to become a powerful tool in cancer 
diagnosis, treatment and prognostication. Large-scale methylation arrays are now available for studying 
methylation genome-wide. The lllumina methylation platform simultaneously measures cytosine methylation at 
more than 1500 CpG sites associated with over 800 cancer-related genes. Cluster analysis is often used to identify 
DNA methylation subgroups for prognosis and diagnosis. However, due to the unique non-Gaussian characteristics, 
traditional clustering methods may not be appropriate for DNA and methylation data, and the determination of 
optimal cluster number is still problematic. 

Method: A Dirichlet process beta mixture model (DPBMM) is proposed that models the DNA methylation 
expressions as an infinite number of beta mixture distribution. The model allows automatic learning of the relevant 
parameters such as the cluster mixing proportion, the parameters of beta distribution for each cluster, and 
especially the number of potential clusters. Since the model is high dimensional and analytically intractable, we 
proposed a Gibbs sampling "no-gaps" solution for computing the posterior distributions, hence the estimates of 
the parameters. 

Result: The proposed algorithm was tested on simulated data as well as methylation data from 55 Glioblastoma 
multiform (GBM) brain tissue samples. To reduce the computational burden due to the high data dimensionality, a 
dimension reduction method is adopted. The two GBM clusters yielded by DPBMM are based on data of different 
number of loci (P-value < 0.1), while hierarchical clustering cannot yield statistically significant clusters. 



Background 

DNA methylation profiles has become an alternative 
molecular footprint for classification. It occurs in the con- 
text of a CpG dinucleotide. It is an important epigenetic 
modification, which can be inherited through cell division. 
In this chemical modification of the cytosine nucleotide, 
the 5-carbon position is enzymatically modified by the 
addition of a methyl group such that cytosines can occur 
in a methylated or unmethylated state. CpG islands are 
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usually not methylated in normal tissues but frequently 
become hypermethylated in cancer [1]. This hypermethy- 
lation is associated with gene silencing [2] and plays an 
important role in the inactivation of tumor suppressor 
genes. Most CpGs or CpG regions have been found to 
have a bimodal distribution of methylation profiles, either 
hypomethylated or hypermethylated [3] . Unique methyla- 
tion patterns have been shown to exist in diseases includ- 
ing various types of cancer [4] . DNA methylation analysis 
promises to become a powerful tool in cancer diagnosis, 
with possible applications to the choice of treatment and 
prognostication. The high throughput methylation profil- 
ing technology has been developed to survey methylation 
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status of more than 1500 CpG sites for a large collection 
of cancer genes and been specifically targeting. Studying 
how the methylation profiles can be used to distinguish 
different subtypes of the tumor has been a focus in current 
cancer research. But most existing algorithms working on 
methylation data are from sequence level. The exact levels 
of methylation expression are not fully considered yet. 

To this end, clustering analysis is often used to identify 
methylation subgroups that are distinct from one another 
in data [5,6]. However, the DNA methylation data presents 
unique challenges. First, it is not appropriate to cluster 
DNA methylation expressions using traditional clustering 
methods. The traditional k-means clustering algorithms 
are based on Gaussian Mixture Model (GMM) assump- 
tions. In GMM, the individual data points are assumed to 
follow multivariate Gaussian distribution and thus the dis- 
tance between two points can be evaluated by Euclidean 
distance conveniently. However, since "beta" values from 
DNA methylation array represent the percentage of the 
methylated alleles and are between 0[1], traditional GMM 
is no longer appropriate. Instead, a mixture of the beta dis- 
tribution [7,8] would be a more accurate model. Second, a 
model selection process is often needed in clustering to 
determine the number of clusters, making the clustering 
analysis more complicated. A predefined number of clus- 
ters (or model) is required in the mixture distribution 
based methods (such as k-means). Since different number 
of clusters will yield different clustering results, a model 
selection process is desirable to determine the best num- 
ber of clusters. The model selection is very different pro- 
blem, whose optimal solution is of exponential complexity. 
The popular suboptimal solutions have been proposed 
that include minimum description length (MDL) and 
Bayesian information criterion (BIC). Although computa- 
tionally efficient, these methods would fail when clusters 
are not well separated. The recent proposed nonpara- 
metric Bayesian methods including Dirichlet process (DP) 
provide an avenue that can lead to a better solution. 

In a response to the aforementioned limitations, we 
proposed here a nonparametric Dirichlet process beta 
mixture model (DPBMM) method for clustering DNA 
methylation expression profiles produced by Illumina 
Infinium Beadchip. DPBMM makes use of Dirichlet pro- 
cess mixture to place a prior [9] on cluster assignment, 
thus enables automatic determination of the optimal 
number of clusters. To perform the analytical intractable 
learning, an algorithm based on Gibbs sampling and "no- 
gap" sampling is developed to effectively infer all the rele- 
vant variables. The proposed DPBMM method builds an 
infinite beta mixture model to describe DNA methylation 
data, which is different from the finite beta mixture 
model in [8]. We present a simulation study comparing 
its properties to RPMM (Recursively partitioned mixture 
model) employing BIC (Bayesian information criterior) 



in [8]. The results demonstrated the better performance 
of our proposed method. Finally, we applied the DPBMM 
to the methylation array obtained from 55 Glioblastoma 
Multiform (GBM) brain tissue samples. 

Methods 

Problem formulation 

Model DNA methylation profiles with beta mixture 
distribution 

For a two-color hybridization based array such as Illumina 
Infinium array, the measurements are associated with the 
percentage of the methylated alleles, which is called the 
"beta" values because it can be described by a mixture of 
beta distributions [7,10]. Since the distribution of "beta" 
values shows bimodalities [11], the beta distribution com- 
ponent in the mixture model should be convex, which 
means the beta distribution component should be 
equipped with large parameters, shown in Figure 1. 

Consider the problem of clustering n independent 
DNA methylation samples, let X = {X lt X 2 , X„} be the 
DNA methylation expressions for n samples. For each 
sample i, X t = {x a , x i2 , x iL } be a vector of L continuous 
outcomes falling between zero and one. Suppose there 
exists a total K clusters and sample i belongs to cluster 
class c, e {1, K). Conditional on class membership say 
k, each outcome x a could be viewed as an independent 
identically distributed variable from a beta distribution 
with a ki and P ki 



f[xu\a k i,PkhCi = k) = 



(1 - xu) 



B{a kh Pm) 



(1) 



where B(a, P) = J* x" _1 (l - xf~ x &x stands for the 
Beta function. Then, DNA methylation sample X, can be 
modeled by (2). 
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B(a k uPki) 



(2) 



where 9i = {ay, Pm, V/}. With the limitation of large para- 
meters for beta distribution component, a k! >1 and j3 k! >1. 
Note that due to clustering in samples, (9/ and for i * / 
may be equal, Ti' k s represent the cluster proportion and 
J2k=i n k = 1- Now, in reality, the total cluster number A" is 
not known a priori. We discuss next a model based on 
Dirichlet process to address this difficulty. 
Dirichlet process mixture model 

The Dirichlet process is an nonparametric extension of the 
original Dirichlet distribution. Let x t be a random sample 
from a distribution /with parameters In a Bayesian for- 
mulation, the model for parameter f?, can be defined as 



Xj\8i 
0i\G 
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where G is the prior distribution of 6 t . It is not always 
realistic to assume that G is of a known form and the 
nonparametric Bayesian models including the Dirichlet 
process (DP) is proposed to address this problem. Now, 
instead of defining a parametric form for G, G is 
assumed to be a draw from a Dirichlet process with a 
base distribution G 0 and a precision parameter r [12]. 
The model for the Bayesian estimation is also built in 
Figure 2 following the principles of graphical models. It 
can also be written as (4) with a DP prior. 

Xi\Oi~m) 

9i\G ~ G (4) 
G|r,G 0 ~ DP(r,G 0 ) 

where G 0 is such that E[G] = G 0 and has a parametric 
form, r measures the strength of belief in G 0 . The DP of 
mixtures (DPM) are proposed to model the clustering 
effect in data. Compared with other clustering models, 
DPM is very attractive because it allows the cluster 
number K to be a priori °° and learned from the data. 



To capture the clustering natural of DNA methylation 
samples, a beta mixture model with infinite classes can 
be built with DPM. Let O t = {a h /?,■} be the set of para- 
meters for each sample and note that some of them 
may be equal. In DPM models, each O t is marginally 
sampled from G 0 , and with positive probability some of 
the 6j are identical due to the discreteness of the ran- 
dom measure G. Therefore the new value of 6 t can 
either be one of the 9[s{l f i), or could be a new draw 
from G 0 . Let K in (2) be °°, we assume a DPBMM for 
DNA methylation array. 

Inference 

Let O = {$!, ® 2 , •••> ^k) denote the set of distinct 9-s, 
where K is the number of distinct elements of 0\, 8 m . 
Let s = {si, s m } denote cluster assignment vector, that 
means, s ; = Z if and only if 6 t = q>[. Then 0 = : i = 1, 
m} can be reparameterized as {(p lt (p h s lt s m }. Let n it i 
= 1, A" be the number of elements S/ equal to i. Let sub- 
script "-i" stands for all the variables except the i-th one. 
The goal from a Bayesian perspective is to calculate the 
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Figure 2 Graphical model. The model for the Bayesian estimation is built following the principles of graphical model. 



posterior distribution of the known parameters {0, n, t}. 
However, the analytical expression is intractable and we 
instead develop a Gibbs sampling solution to obtain ran- 
dom samples from the posterior distribution. The key for 
Gibbs sampling is to derive the conditional posterior dis- 
tributions of the unknown parameters. Due to the con- 
strains on a and fi, we first re-parameterize a as L a by a = 
exp(\L a \) and /3 as Lp by /3 = exp(\Lp\). Thus, we only need 
to sample in the range of (-°°, °°) for L a and Lp. Then the 
transformed a >1 and j3 >1. Thus, we can specify G 0 as 
G 0 {a,p) = N{0,a 2 )M{0,aj\ where AT fa, er 2 ) repre- 
sents the Gaussian distribution with mean p and variance 
O 2 [13]. The prior distribution of the cluster proportion n 
is the Dirichlet distribution 



tz ~ Dir(ni + r/K, tlx + t/K). 



(5) 



There are some useful expression of a Dirichlet pro- 
cess, such as Chinese Restaurant Process(CRP) [14,15], 
Stick-breaking construction [16], Polya Urn formulation 
[17,18], etc... Blackwell showed that Dirichlet process are 
discrete as they consist of countably infinite point prob- 
ability masses [19]. Escobar and West [20] first showed 
that Markov Chain Monte Carlo (MCMC) techniques, 
specifically Gibbs sampling, could be used for posterior 
density estimation if the Blackwell-MacQueen Polya Urn 
formulation of Dirichlet process is used. Based on the 
generalized Polya urn scheme, the conditional prior dis- 
tributions s,|si, S/.i, i = 1, n and 0i\6.i have the fol- 
lowing forms as (6) and (7). 



P(5i = 1) = 1 



P{Si 



51, 



r 5,_1 



(T-H-l) 

P(S, = k,+ l|Sl,...,Sj_l) = ^PjJ 



I = l,...,fe, 



(6) 



and, 



0i ~ G 0 (9i) 



T K 1 

Wi-Oi-i t ——G 0 {6 i ) + Y^n, S^iOi), fori> 1. 

r + i-1 f— ' t + I — 1 



(7) 



Then the conditional posterior distribution for sam- 
pling Oi has the form 



n-l 



p{9 i \9- i ,S- i ,X)cxq ii0 G i {9 i ) + £ qM0 { ) 

l = Wi (8) 

K 

= faoQiOi) + n-i,iqi,i8<b,[9i). 



Thus the conditional posterior distribution for sam- 
pling <J>, has the form 



p(<Z>i\<J>-i,S,X,7T) 

OC p{X m:Sm = Si |<t>, S, 7T )p(Q>i 1 S,7t) 



(9) 



go n n 

m:s m =Si 1=1 



B(a M , fikl) 



It is obvious that G 0 is not conjugate with/, so the 
integral q i 0 cannot be evaluated analytically and drawing 
samples from G, is also extremely challenging [21]. To 
overcome the difficulty, we adopt the "no-gaps" algo- 
rithm proposed in [22] to enable sampling from (8). 

As to t, it is useful to choose a weakly informative 
prior in many applications. If r is assigned a gamma 
prior, its posterior becomes a simple function of K, then 
samples are easily drawn via an auxiliary variable 
method. For the convenience of sampling, we adopt the 
r ~ Gamma(a, b) as the prior [9,20]. 
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The final Gibbs sampling steps can be summarized by 
the following steps: 

Gibbs sampling for DPBMM 

Iterate the following steps and for the t-th iteration: 

1. For each sample i, re-sample s ; according to (6) if 
n Sj > 1. In this case k. t = K. If n Sj = 1, then with prob- 
ability 1 - 1 IK leave s, unchanged. With probability 
II K rearrange s such that s ; = K, then re-sample s, 
according to (6). But in this case k. t = K - 1. 

2. For i = 1, K, the posterior distribution for <t>, 
has the form as (9). 

For i = K + 1, n, both prior and posterior distri- 
bution for <t> ; are G 0 . 

3. Sample n following (5) with nu = Y11=i ^fa, 

4. Based on Step 1, we can get the value of K, then 
sample r\K, n where r ~ Gamma{a, b). 

Due to the large number of parameters, the initial values 
for parameters and aj should be chosen carefully. 

Results 

Test on simulated data 

We conducted simulations to test our proposed method. 
For the first case, the simulated data set is generated 
based on the model described in (2) with K = 4. The 
simulated dataset consists of 100 samples, each having 
200 continuous response lying in the unit interval. The 
occurring probability of each cluster is set to {0.2, 0.3, 
0.2, 0.3}. For each cluster, parameters L a , Lp related to 
beta distribution in the model are generated randomly 
from Gaussian distributions with zero means and differ- 
ent variances. In order to systematically evaluate the clus- 
tering performance, the F metric that combines BCubed 
overall precision and recall [23] was implemented as sug- 
gested in [24]. Let {c} represent the real cluster label of 
samples and {s} represent the cluster assignment by clus- 
tering method, the correctness of the relation between 
sample i and V is defined as Ct(i, i) based on {c} and {5}. 

Q(l ' !)= i0 otherwise. (10) 

The overall precision P and recall R are defined as 

R=Avg l [Avg i , Cr _ Ci \Ct{i,i')]] 
P = Av gi [Av gi , Ji=Sf [Ct(i,i')]] 

F metrics is used to evaluate the clustering result by 
combining P and R metrics. 

^= 0.5^(1-0.5)/* (12) 



Figure 3(a) illustrates the sampled number of clusters 
in each Gibbs sampling iteration for one time of 
DPBMM clustering. After 300 iterations of "burn-in" 
stage, the number of clusters stay at four. The uncovered 
cluster proportion is {0.19, 0.31, 0.19, 0.31}. Figure 3(b-d) 
show that for 2000 times of DPBMM clustering, F metric 
can come to one for most times. 

For our second case, we used two simulated data set 
from [8]. The data set of Case I consists of 100 subjects, 
which mimics the real methylation data. Each subject has 
1413 continuous responses lying in the unit interval. Each 
subject was a member of five classes, each cluster occur- 
ring with 0.2 probability. The clusters were defined by 
beta-distribution parameters for each of 1413 methylation 
loci that were autosomal and passed quality-assurance, 
obtained by fitting a beta model on each locus to one of 
the five data sets from our normal data: adult blood, new- 
born blood, placenta, lung/pleura, and everything else. 
The data set of Case II considered 100 subjects from four 
clusters. We compare the performance with RPMM 
method proposed in [8], with the same dimension reduc- 
tion method employed. We order all the loci with respect 
to variance, and the / most variable loci are considered in 
the clustering algorithm. Table 1 and Table 2 summarizes 
the number of classes found with RPMM and with our 
proposed DPBMM for both Case I and Case II. For the 
cases considered, DPBMM obtained the correct K with a 
priori °o directly while the RPMM fitted finite mixture 
models for a range of possible values and chose the correct 
K by BIC statistic. The F metric vs. recall curve of / e {25, 
50} loci for case I is shown in Figure 4(a). The histogram 
of F metric results with / = 50 is shown in Figure 4(b). The 
F metric vs. recall curve of different / e {5, 10} loci for 
Case II is shown in Figure 4(c). The histogram of F metric 
results with / = 10 is shown in Figure 4(d). For the above 
two cases, the more the number of loci are considered in 
the clustering, the better clustering performance we can 
get. 

Test on real data 

We then applied our proposed DPBMM clustering on the 
GBM methylation microarray dataset in The Cancer 
Genome Atlas (TCGA). This dataset consists of 74 
patients assayed on Illumina HumanMethylation450 array. 
Samples for DPBMM clustering analysis were selected to 
have clinical annotations. At last, 55 patients were left for 
consideration. Twenty-seven patients were alive at the 
time of last follow up, whereas twenty-eight patients 
experienced disease progression since last follow-up. The 
median follow up time was 198 days (range, 2-953 days). 
Each sample includes up to 485,577 CpG dinucleotides 
spanning gene-associated elements as well as intergenic 
regions. The associated detection P-value reported 
together with the methylation expression data is used as a 
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Figure 3 Clustering evaluation on simulation data set. The result is based on the simulated data with 4 dimensions. Figure 3(a) 
number of clusters k in 2000 MCMC iterations. Figure 3(b) shows the overall precision vs. overall recall for 2000 times of DPBMM. The 
precision almost always stays at 1. Figure 3(c) shows the F metric vs. recall curve. Figure 3(d) shows the histogram of F metric results 
times of DPBMM clustering. 



shows the 
overall 
for 2000 



quality control measure of probe performance. Following 
the probe excluding method in [25], the probes with 
detection P-values >0.01 in >10% of the samples are 
excluded from further consideration. 

Since the small sample, large dimensional property of 
methylation array, many loci in the data set have low var- 
iance and may not contribute to clustering, it is safer 
only to consider loci that change significantly [26] . Thus, 
those loci with low variance across all 55 samples were 
removed from the data sets which is also used by [8]. 

Table 1 Number of classes obtained for RPMM and 



DPBMM applied to simulated data (Case I: 5 classes). 



Method 


J 


Median 


Mean 


SD 


RPMM 


25 


8 


7.7 


2.0 




50 


5 


5.6 


1.32 


DPBMM 


25 


5 


5.16 


0.93 




50 


5 


5.29 


1.43 



This also made the DPBMM clustering process computa- 
tionally more tractable. In this paper, we only consider 
/e {1, 2, 20} most variable loci for DPBMM clustering 
method since the number of samples is only 55. The 
selected top 20 variable loci are listed in Table SI (see 
Additional file 1). DPBMM yields two clusters from the 
data for most /. Kaplan-Meier survival analysis are carried 
out based on the clustering results, and the P-values of 
Kaplan-Meier confidence for /e {1, 2, 20} are shown 
in Table S2 (see Additional file 2). Among these, / = 11 

Table 2 Number of classes obtained for RPMM and 



DPBMM applied to simulated data (Case II: 4 classes). 



Method 


J 


Median 


Mean 


SD 


RPMM 


5 


2 


2.0 


0.10 




10 


2 


2.4 


2.38 


DPBMM 


5 


7 


6.9 


1.04 




10 


4 


4.09 


1.60 
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Figure 4 Clustering evaluation based on different J Figure 4(a) shows the F metric vs. recall curve of J e {25, 50} loci for case I, Figure 4(b) 
shows the histogram of F metric results with J = 50. Figure 4(c) shows the F metric vs. recall curve of different J e {5, 10} loci for Case II. Figure 
4(d) shows the histogram of F metric results with J = 10. 



gives the best P-value of 0.03. And the heatmap plot of / 
= 11 is shown in Figure 5, the Kaplan-Meier overall sur- 
vival curve is shown in Figure 6. When / = 11, the clus- 
ters in GBM methylation array uncovered by DPBMM 
are statistically significant (P-value < 0.1). We also ana- 
lyzed the survival of the two clusters uncovered by hier- 
archical clustering, but the clusters yielded are not 
statistically significant (P-value > 0.1). 
The computation time is always an issue for Gibbs sam- 
pling methods. Our simulation is carried out on a Linux 
based high-performance computer cluster. Each proces- 
sing core is equipped with 2GB RAM. Figure 7 displays 
the computation time resulting from the real data study 
described before. The more loci considered for cluster- 
ing, the more time the algorithm takes. 

Discussion 

We discuss next a few distinct features of DPBMM. 
First, in accordance with the fact that "beta" values in 



DNA methylation array data fall in the range of zero to 
one, we assume mixtures of beta distribution for the 
data. It can provide more flexible shapes, thus can 
describe data of various types. This is different from tra- 
ditional Gaussian mixture model based clustering meth- 
ods such as K-means. Second, since most existing 
methods can not determine the number of clusters 
automatically, we adopted a Dirichlet process prior for 
cluster assignment. Thus, we get a non-conjugate 
Dirichlet process beta mixture model, whose parameters 
are hard to estimate. A Gibbs sampling and "no-gap" 
sampling solution is developed to overcome this diffi- 
culty. This is different from traditional parametric meth- 
ods, whose result also relies on a model parameter, 
which is usually determined in a model selection 
process. 

The limitation of the proposed methods are mainly as 
follows. First, the algorithm is based on Gibbs sampling, 
which is somewhat a resource-heavy MCMC method, 
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therefore, the computation time is still heavy. Second, 
the model is computationally too slow to apply to 
methylation data of genome scale. We need to reduce 
the dimensionality to keep DPBMM computationally 
affordable. 



In the future, it would be interesting to develop more 
effective dimension reduction method for DPBMM. It 
would also be interesting to integrate the information 
from different data sources such as gene expression and 
copy numbers variation into one model for cluster 
analysis. 



Kaplan-Meier estimate of overall survival functions, p-value = 0.03 
1m 



Cluster 1,N 


= 20 


Cluster 2,N 


= 35 


* Censored 






Days 



Figure 6 Kaplan-Meier estimate of survival analysis based on 
uncovered structure of DPBMM method (J = 11). The figure 
shows the survival functions of the two clusters obtained based on 
the top 1 1 variable locus (P-value = 0.03) by DPBMM, which is more 
significant than the corresponding result of hierarchical clustering 
(P-value = 0.51). 
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^Computation time for different number of loci in DPBMM. 
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Figure 7 The computation time resulting from the real data 
study for jL {1, 2, 20}. The figure shows the computation time 
resulting from the real data study for J e {1,2, 20}. It is carried out 
on a Linux based high-performance computer cluster. Each 
processing core is equipped with 2GB RAM. With the number of loci 
considered for DPBMM clustering, the computation time increases. 
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Conclusions 

An infinite Dirichlet process beta mixture model was 
proposed to unveil the latent cluster structure from Illu- 
mina Infinium methylation profiles. By utilizing a 
Dirichlet process prior for cluster assignment, the num- 
ber of clusters is determined. A Gibbs sampling and 
"no-gaps" sampling solution was developed to infer the 
relevant parameters automatically. The effectiveness and 
validity of the model and the proposed Gibbs sampler 
were evaluated on simulated data and on real data. The 
results demonstrated that DPBMM could yield the clus- 
ter structure automatically with better accuracy. 

Availability 

MATLAB code is available at https://sites.google.com/ 
site/bdpmmmethy/home. 

Additional material 



Additional file 1: Top 20 variable loci (ranked by variance through 
samples) selected from the methylation profiles of the 55 GBM 
samples 

Additional file 2: The number of uncovered clusters and P-value of 
overall survival analysis for J L {1, 2, 20}. P-value is used to test the 
Kaplan-Meier confidence. 
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